Off-Policy Shaping Ensembles in Reinforcement Learning
نویسندگان
چکیده
Recent advances of gradient temporal-difference methods allow to learn off-policy multiple value functions in parallel without sacrificing convergence guarantees or computational efficiency. This opens up new possibilities for sound ensemble techniques in reinforcement learning. In this work we propose learning an ensemble of policies related through potential-based shaping rewards. The ensemble induces a combination policy by using a voting mechanism on its components. Learning happens in real time, and we empirically show the combination policy to outperform the individual policies of the ensemble.
منابع مشابه
Off-Policy Reward Shaping with Ensembles
Potential-based reward shaping (PBRS) is an effective and popular technique to speed up reinforcement learning by leveraging domain knowledge. While PBRS is proven to always preserve optimal policies, its effect on learning speed is determined by the quality of its potential function, which, in turn, depends on both the underlying heuristic and the scale. Knowing which heuristic will prove effe...
متن کاملMasters Thesis: Shaping Methods to Accelerate Reinforcement Learning:
Reinforcement learning (RL) is an attractive solution for deriving an optimal control policy by on-line exploration of the control task. In reinforcement learning there is no need to specify how the task is to be achieved. In fact, RL is a way of programming the agents by specifying a reward function. At every time step, the controller (agent) receives the process (environment) state, takes an ...
متن کاملEnsemble Usage for More Reliable Policy Identification in Reinforcement Learning
Reinforcement learning (RL) methods employing powerful function approximators like neural networks have become an interesting approach for optimal control. Since they learn a policy from observations, they are also applicable when no analytical description of the system is available. Although impressive results have been reported, their handling in practice is still hard, as they can fail at re...
متن کاملPolicy Transfer using Reward Shaping
Transfer learning has proven to be a wildly successful approach for speeding up reinforcement learning. Techniques often use low-level information obtained in the source task to achieve successful transfer in the target task. Yet, a most general transfer approach can only assume access to the output of the learning algorithm in the source task, i.e. the learned policy, enabling transfer irrespe...
متن کاملOn-Policy vs. Off-Policy Updates for Deep Reinforcement Learning
Temporal-difference-based deep-reinforcement learning methods have typically been driven by off-policy, bootstrap Q-Learning updates. In this paper, we investigate the effects of using on-policy, Monte Carlo updates. Our empirical results show that for the DDPG algorithm in a continuous action space, mixing on-policy and off-policy update targets exhibits superior performance and stability comp...
متن کامل